Spotify Dataset Exploratory Data Analysis (EDA)¶

Introduction¶

In this Kernel, I explore how audio features and song elements have an effect on popularity. I also analyze the ways in which specific song characteristics (such as song length and sound features) have altered overtime.

Import Libraries¶

In [1]:
import os
import numpy as np
import pandas as pd
pd.plotting.register_matplotlib_converters()

import seaborn as sns
import plotly.express as px 
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.metrics import euclidean_distances
from scipy.spatial.distance import cdist

import warnings
warnings.filterwarnings("ignore")

Import Data¶

In [2]:
data = pd.read_csv("/Users/vaishnavi/Downloads/data/data.csv")
genre_data = pd.read_csv("/Users/vaishnavi/Downloads/data/data_by_genres.csv")
year_data = pd.read_csv("/Users/vaishnavi/Downloads/data/data_by_year.csv")

Read Data¶

Using the .info() method to analyze the types of each column and check for any missing values in all three datasets. With no missing values, no columns have to be dropped. The .head() and .tail() methods provide the first 5 and last 5 rows, correspondingly.

In [3]:
print(data.info())
print(genre_data.info())
print(year_data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170653 entries, 0 to 170652
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   valence           170653 non-null  float64
 1   year              170653 non-null  int64  
 2   acousticness      170653 non-null  float64
 3   artists           170653 non-null  object 
 4   danceability      170653 non-null  float64
 5   duration_ms       170653 non-null  int64  
 6   energy            170653 non-null  float64
 7   explicit          170653 non-null  int64  
 8   id                170653 non-null  object 
 9   instrumentalness  170653 non-null  float64
 10  key               170653 non-null  int64  
 11  liveness          170653 non-null  float64
 12  loudness          170653 non-null  float64
 13  mode              170653 non-null  int64  
 14  name              170653 non-null  object 
 15  popularity        170653 non-null  int64  
 16  release_date      170653 non-null  object 
 17  speechiness       170653 non-null  float64
 18  tempo             170653 non-null  float64
dtypes: float64(9), int64(6), object(4)
memory usage: 24.7+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2973 entries, 0 to 2972
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   mode              2973 non-null   int64  
 1   genres            2973 non-null   object 
 2   acousticness      2973 non-null   float64
 3   danceability      2973 non-null   float64
 4   duration_ms       2973 non-null   float64
 5   energy            2973 non-null   float64
 6   instrumentalness  2973 non-null   float64
 7   liveness          2973 non-null   float64
 8   loudness          2973 non-null   float64
 9   speechiness       2973 non-null   float64
 10  tempo             2973 non-null   float64
 11  valence           2973 non-null   float64
 12  popularity        2973 non-null   float64
 13  key               2973 non-null   int64  
dtypes: float64(11), int64(2), object(1)
memory usage: 325.3+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   mode              100 non-null    int64  
 1   year              100 non-null    int64  
 2   acousticness      100 non-null    float64
 3   danceability      100 non-null    float64
 4   duration_ms       100 non-null    float64
 5   energy            100 non-null    float64
 6   instrumentalness  100 non-null    float64
 7   liveness          100 non-null    float64
 8   loudness          100 non-null    float64
 9   speechiness       100 non-null    float64
 10  tempo             100 non-null    float64
 11  valence           100 non-null    float64
 12  popularity        100 non-null    float64
 13  key               100 non-null    int64  
dtypes: float64(11), int64(3)
memory usage: 11.1 KB
None
In [4]:
data.head()
Out[4]:
valence year acousticness artists danceability duration_ms energy explicit id instrumentalness key liveness loudness mode name popularity release_date speechiness tempo
0 0.0594 1921 0.982 ['Sergei Rachmaninoff', 'James Levine', 'Berli... 0.279 831667 0.211 0 4BJqT0PrAfrxzMOxytFOIz 0.878000 10 0.665 -20.096 1 Piano Concerto No. 3 in D Minor, Op. 30: III. ... 4 1921 0.0366 80.954
1 0.9630 1921 0.732 ['Dennis Day'] 0.819 180533 0.341 0 7xPhfUan2yNtyFG0cUWkt8 0.000000 7 0.160 -12.441 1 Clancy Lowered the Boom 5 1921 0.4150 60.936
2 0.0394 1921 0.961 ['KHP Kridhamardawa Karaton Ngayogyakarta Hadi... 0.328 500062 0.166 0 1o6I8BglA6ylDMrIELygv1 0.913000 3 0.101 -14.850 1 Gati Bali 5 1921 0.0339 110.339
3 0.1650 1921 0.967 ['Frank Parker'] 0.275 210000 0.309 0 3ftBPsC5vPBKxYSee08FDH 0.000028 5 0.381 -9.316 1 Danny Boy 3 1921 0.0354 100.109
4 0.2530 1921 0.957 ['Phil Regan'] 0.418 166693 0.193 0 4d6HGyGT8e121BsdKmw9v6 0.000002 3 0.229 -10.096 1 When Irish Eyes Are Smiling 2 1921 0.0380 101.665
In [5]:
data.tail()
Out[5]:
valence year acousticness artists danceability duration_ms energy explicit id instrumentalness key liveness loudness mode name popularity release_date speechiness tempo
170648 0.608 2020 0.08460 ['Anuel AA', 'Daddy Yankee', 'KAROL G', 'Ozuna... 0.786 301714 0.808 0 0KkIkfsLEJbrcIhYsCL7L5 0.000289 7 0.0822 -3.702 1 China 72 2020-05-29 0.0881 105.029
170649 0.734 2020 0.20600 ['Ashnikko'] 0.717 150654 0.753 0 0OStKKAuXlxA0fMH54Qs6E 0.000000 7 0.1010 -6.020 1 Halloweenie III: Seven Days 68 2020-10-23 0.0605 137.936
170650 0.637 2020 0.10100 ['MAMAMOO'] 0.634 211280 0.858 0 4BZXVFYCb76Q0Klojq4piV 0.000009 4 0.2580 -2.226 0 AYA 76 2020-11-03 0.0809 91.688
170651 0.195 2020 0.00998 ['Eminem'] 0.671 337147 0.623 1 5SiZJoLXp3WOl3J4C8IK0d 0.000008 2 0.6430 -7.161 1 Darkness 70 2020-01-17 0.3080 75.055
170652 0.642 2020 0.13200 ['KEVVO', 'J Balvin'] 0.856 189507 0.721 1 7HmnJHfs0BkFzX4x8j0hkl 0.004710 7 0.1820 -4.928 1 Billetes Azules (with J Balvin) 74 2020-10-16 0.1080 94.991

Understanding the Data through Vizualization¶

A general understanding of music production over the years is grasped through a count plot of the number of songs by the decade.

In [6]:
# function that provides corresponding decade of year
def decade_of_year(year):
    period = int(year/10) * 10
    return str(period) + "s"
In [7]:
# Add a column the data set that has the corresponding decade of the year
data['year_to_decade'] = data['year'].apply(decade_of_year)

# Set the width and height of the figure
plt.figure(figsize=(10,6))

# Add title
plt.title("Count of Songs by Decade")

# Count chart showing counts of songs released every decade since the 1920s
sns.countplot(x=data['year_to_decade'])

# Add label for vertical axis
plt.ylabel("Count of Songs")
Out[7]:
Text(0, 0.5, 'Count of Songs')

It is evident that music production grew drastically and linearly from the 1920s until the 1950s and leveled off soon after. The data for the 2020s decade is much lower than the rest of the data as we are early into the 2020s.


The enhanced box plot below checks for any strong outliers in popularity when grouped by decades.

In [8]:
sns.boxenplot(data=data, x="popularity", y="year_to_decade")
Out[8]:
<AxesSubplot:xlabel='popularity', ylabel='year_to_decade'>

There are visible strong outliers that can be reasoned for as hits of the decades as they are far from the mean and the quantiles on the right end. Outliers on the far left of the ehanced box plots are songs that never picked up, largely differing from the rest of the songs in their decade. Further exploration and vizualization below aims to provide understanding for the drastic variation in popularity.


The average popularity of the entire dataset is calculated to be compared to the average popularity of explicit songs and the average popularity of clean songs. This helps reason for whether or not explicitness contributes to popularity.

In [9]:
# Average is calculated
avg_pop = (data.popularity.sum() / data.popularity.value_counts().sum())

avg_pop
Out[9]:
31.431794342906365
In [10]:
# New data frame with only explicit songs
df1=data[data['explicit']==1]

# Average is calculated
df1_avg_pop = (df1.popularity.sum() / df1.popularity.value_counts().sum())

df1_avg_pop
Out[10]:
45.186170581306726
In [11]:
# New data frame with only clean songs
df2=data[data['explicit']==0]

# Average is calculated
df2_avg_pop = (df2.popularity.sum() / df2.popularity.value_counts().sum())

df2_avg_pop
Out[11]:
30.161042120087057

The average popularity value of explicit songs is significantly greater than average popularity value of both the entire dataset and clean music. However, this does not prove immediate correlation but encourages further exploration.


The pairplot below provides further vizualtions, comparing sound features (and the other columns in the dataset) with the popularity value, while taking into account explicitness through the color of the data point. Blue represents clean music, while orange represents explicit music.

In [12]:
# Pairplot showing the relationship between all of the columns in the data table and popularity
sns.pairplot(data.sample(1000), y_vars = ["popularity"], hue="explicit")
Out[12]:
<seaborn.axisgrid.PairGrid at 0x7ff9b8621910>

It can be seen that no matter the vizualization, the orange data points are often clustereted at the top half of the graph, representing a greater popularity value. It is also observed that explicit music was produced more commonly after the 2000s. Furthermore, it can be seen that there seems to be an association between loud music and explicitness.


Read Genre Data¶

In [13]:
genre_data.head()
Out[13]:
mode genres acousticness danceability duration_ms energy instrumentalness liveness loudness speechiness tempo valence popularity key
0 1 21st century classical 0.979333 0.162883 1.602977e+05 0.071317 0.606834 0.361600 -31.514333 0.040567 75.336500 0.103783 27.833333 6
1 1 432hz 0.494780 0.299333 1.048887e+06 0.450678 0.477762 0.131000 -16.854000 0.076817 120.285667 0.221750 52.500000 5
2 1 8-bit 0.762000 0.712000 1.151770e+05 0.818000 0.876000 0.126000 -9.180000 0.047000 133.444000 0.975000 48.000000 7
3 1 [] 0.651417 0.529093 2.328809e+05 0.419146 0.205309 0.218696 -12.288965 0.107872 112.857352 0.513604 20.859882 7
4 1 a cappella 0.676557 0.538961 1.906285e+05 0.316434 0.003003 0.172254 -12.479387 0.082851 112.110362 0.448249 45.820071 7
In [14]:
genre_data.tail()
Out[14]:
mode genres acousticness danceability duration_ms energy instrumentalness liveness loudness speechiness tempo valence popularity key
2968 1 zolo 0.222625 0.547082 258099.064530 0.610240 0.143872 0.204206 -11.295878 0.061088 125.494919 0.596155 33.778943 9
2969 0 zouglou 0.161000 0.863000 206320.000000 0.909000 0.000000 0.108000 -5.985000 0.081300 119.038000 0.845000 58.000000 7
2970 1 zouk 0.263261 0.748889 306072.777778 0.622444 0.257227 0.089678 -10.289222 0.038778 101.965222 0.824111 46.666667 5
2971 0 zurich indie 0.993000 0.705667 198417.333333 0.172667 0.468633 0.179667 -11.453333 0.348667 91.278000 0.739000 0.000000 7
2972 1 zydeco 0.421038 0.629409 171671.690476 0.609369 0.019248 0.255877 -9.854825 0.050491 126.366087 0.808544 30.261905 7

Understanding the Data through Vizualization¶

To understand whether the average length of a song of a genre influences the popularity value, a histogram can provide a vizualtion to analyze the relationship between the two.

In [15]:
# A new column is added where the duration is in the units of minutes, instead of milliseconds
genre_data['duration_in_min'] = genre_data['duration_ms'] / 60000

# Set the width and height of the figure
plt.figure(figsize=(14,10))

# Histogram that depicts the relationship between duration in popularity
sns.histplot(genre_data, x="duration_in_min", y = "popularity")
Out[15]:
<AxesSubplot:xlabel='duration_in_min', ylabel='popularity'>

The densest set of points in the graph, represented by the dark blue, is where most of the genres lie with an average song length of approximately 4 minutes and a popularity value between 40 and 60. Looking at the average song lengths of the most popular genres, most of the genres with a popularity value greater than 60 lie left of the 5 minute increment. On the other hand, all of the genres past around 6 minutes have a popularity value below 60. This goes to show that shorter songs partially contribute to a greater popularity value.


To see what else could contribute to the popularity value of a genre, the most popular genres are analzyed by its acousticness, danceability, energy, instrumentalness, liveness, and valence through a grouped bar chart.

In [16]:
# The top 15 genres with the largest 15 popularity values are extracted to a new data frame
top_15 = genre_data.nlargest(15, 'popularity')
In [17]:
# A bar plot with the top 15 genres by popularity and the 6 features along with their values are graphed
fig = px.bar(top_15, x='genres', y=['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'valence'], barmode='group')
fig.show()

The grouped bar chart allows us to vizualize the audio features that may have contributed to the popularity of the top 15 genres. It can be observed that danceability and energy are consistently high throughout the genres. This encourages further vizualization.

In [18]:
# Scatterplot with a regression line to assess the correlation between energy and popularity
sns.regplot(x=genre_data['energy'], y=genre_data['popularity'])
Out[18]:
<AxesSubplot:xlabel='energy', ylabel='popularity'>
In [19]:
# Scatterplot with a regression line to assess the correlation between danceability and popularity
sns.regplot(x=genre_data['danceability'], y=genre_data['popularity'])
Out[19]:
<AxesSubplot:xlabel='danceability', ylabel='popularity'>

Read Year Data¶

In [20]:
year_data.head()
Out[20]:
mode year acousticness danceability duration_ms energy instrumentalness liveness loudness speechiness tempo valence popularity key
0 1 1921 0.886896 0.418597 260537.166667 0.231815 0.344878 0.205710 -17.048667 0.073662 101.531493 0.379327 0.653333 2
1 1 1922 0.938592 0.482042 165469.746479 0.237815 0.434195 0.240720 -19.275282 0.116655 100.884521 0.535549 0.140845 10
2 1 1923 0.957247 0.577341 177942.362162 0.262406 0.371733 0.227462 -14.129211 0.093949 114.010730 0.625492 5.389189 0
3 1 1924 0.940200 0.549894 191046.707627 0.344347 0.581701 0.235219 -14.231343 0.092089 120.689572 0.663725 0.661017 10
4 1 1925 0.962607 0.573863 184986.924460 0.278594 0.418297 0.237668 -14.146414 0.111918 115.521921 0.621929 2.604317 5
In [21]:
year_data.tail()
Out[21]:
mode year acousticness danceability duration_ms energy instrumentalness liveness loudness speechiness tempo valence popularity key
95 1 2016 0.284171 0.600202 221396.510295 0.592855 0.093984 0.181170 -8.061056 0.104313 118.652630 0.431532 59.647190 0
96 1 2017 0.286099 0.612217 211115.696787 0.590421 0.097091 0.191713 -8.312630 0.110536 117.202740 0.416476 63.263554 1
97 1 2018 0.267633 0.663500 206001.007133 0.602435 0.054217 0.176326 -7.168785 0.127176 121.922308 0.447921 63.296243 1
98 1 2019 0.278299 0.644814 201024.788096 0.593224 0.077640 0.172616 -7.722192 0.121043 120.235644 0.458818 65.256542 1
99 1 2020 0.219931 0.692904 193728.397537 0.631232 0.016376 0.178535 -6.595067 0.141384 124.283129 0.501048 64.301970 1
In [22]:
year_data['year_to_decade'] = year_data['year'].apply(decade_of_year)
year_data['duration_in_min'] = year_data['duration_ms'] / 60000
In [23]:
plt.figure(figsize=(14,10))
sns.histplot(data=year_data, x="year_to_decade", y = "duration_in_min")
Out[23]:
<AxesSubplot:xlabel='year_to_decade', ylabel='duration_in_min'>
In [24]:
group = year_data.groupby(by = "year_to_decade")
group.first()
Out[24]:
mode year acousticness danceability duration_ms energy instrumentalness liveness loudness speechiness tempo valence popularity key duration_in_min
year_to_decade
1920s 1 1921 0.886896 0.418597 260537.166667 0.231815 0.344878 0.205710 -17.048667 0.073662 101.531493 0.379327 0.653333 2 4.342286
1930s 1 1930 0.936715 0.518176 195150.285343 0.333524 0.352206 0.221311 -12.869221 0.119910 109.871194 0.616238 0.926715 2 3.252505
1940s 1 1940 0.847644 0.521892 182227.944500 0.310893 0.316849 0.264335 -13.684048 0.242958 108.449334 0.616709 0.930000 7 3.037132
1950s 1 1950 0.853941 0.504253 215073.125500 0.314071 0.245001 0.216958 -13.863834 0.153453 111.749725 0.551650 3.206500 7 3.584552
1960s 1 1960 0.767181 0.486029 210209.683784 0.341142 0.176502 0.207864 -13.814103 0.065784 112.561679 0.523932 19.783784 0 3.503495
1970s 1 1970 0.460057 0.506308 242852.151500 0.495633 0.127567 0.212269 -11.772558 0.051681 117.111610 0.572075 34.394500 2 4.047536
1980s 1 1980 0.284955 0.556152 252835.533333 0.597777 0.128751 0.203754 -10.700942 0.059249 122.985001 0.598058 36.206667 0 4.213926
1990s 1 1990 0.332870 0.535299 256451.403500 0.571591 0.125826 0.190961 -11.327479 0.064345 120.062734 0.526527 40.785500 7 4.274190
2000s 1 2000 0.289323 0.590918 242724.642638 0.625413 0.101168 0.197686 -8.247766 0.089205 118.999323 0.559475 46.684049 7 4.045411
2010s 1 2010 0.242687 0.572488 242811.804563 0.681778 0.082981 0.199701 -6.909904 0.081031 123.570215 0.520895 52.730159 0 4.046863
2020s 1 2020 0.219931 0.692904 193728.397537 0.631232 0.016376 0.178535 -6.595067 0.141384 124.283129 0.501048 64.301970 1 3.228807
In [25]:
sorted = group.first().sort_values(by = "popularity")
sorted
Out[25]:
mode year acousticness danceability duration_ms energy instrumentalness liveness loudness speechiness tempo valence popularity key duration_in_min
year_to_decade
1920s 1 1921 0.886896 0.418597 260537.166667 0.231815 0.344878 0.205710 -17.048667 0.073662 101.531493 0.379327 0.653333 2 4.342286
1930s 1 1930 0.936715 0.518176 195150.285343 0.333524 0.352206 0.221311 -12.869221 0.119910 109.871194 0.616238 0.926715 2 3.252505
1940s 1 1940 0.847644 0.521892 182227.944500 0.310893 0.316849 0.264335 -13.684048 0.242958 108.449334 0.616709 0.930000 7 3.037132
1950s 1 1950 0.853941 0.504253 215073.125500 0.314071 0.245001 0.216958 -13.863834 0.153453 111.749725 0.551650 3.206500 7 3.584552
1960s 1 1960 0.767181 0.486029 210209.683784 0.341142 0.176502 0.207864 -13.814103 0.065784 112.561679 0.523932 19.783784 0 3.503495
1970s 1 1970 0.460057 0.506308 242852.151500 0.495633 0.127567 0.212269 -11.772558 0.051681 117.111610 0.572075 34.394500 2 4.047536
1980s 1 1980 0.284955 0.556152 252835.533333 0.597777 0.128751 0.203754 -10.700942 0.059249 122.985001 0.598058 36.206667 0 4.213926
1990s 1 1990 0.332870 0.535299 256451.403500 0.571591 0.125826 0.190961 -11.327479 0.064345 120.062734 0.526527 40.785500 7 4.274190
2000s 1 2000 0.289323 0.590918 242724.642638 0.625413 0.101168 0.197686 -8.247766 0.089205 118.999323 0.559475 46.684049 7 4.045411
2010s 1 2010 0.242687 0.572488 242811.804563 0.681778 0.082981 0.199701 -6.909904 0.081031 123.570215 0.520895 52.730159 0 4.046863
2020s 1 2020 0.219931 0.692904 193728.397537 0.631232 0.016376 0.178535 -6.595067 0.141384 124.283129 0.501048 64.301970 1 3.228807
In [26]:
fig = px.bar(sorted, y=['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'valence'], barmode='group')
fig.show()
In [27]:
# 2 more vizualizations